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Abstract 

Segmenting semantic objects from images and parsing 
them into their respective semantic parts are fundamental 
steps towards detailed object understanding in computer vi¬ 
sion. In this paper, we propose a joint solution that tack¬ 
les semantic object and part segmentation simultaneously, 
in which higher object-level context is provided to guide 
part segmentation, and more detailed part-level localiza¬ 
tion is utilized to refine object segmentation. Specifically, 
we first introduce the concept of semantic compositional 
parts (SCP) in which similar semantic parts are grouped 
and shared among different objects. A two-channel fully 
convolutional network (FCN) is then trained to provide the 
SCP and object potentials at each pixel. At the same time, a 
compact set of segments can also be obtained from the SCP 
predictions of the network. Given the potentials and the 
generated segments, in order to explore long-range context, 
we finally construct an efficient fully connected conditional 
random field (FCRF) to jointly predict the final object and 
part labels. Extensive evaluation on three different datasets 
shows that our approach can mutually enhance the perfor¬ 
mance of object and part segmentation, and outperforms the 
current state-of-the-art by a large margin on both tasks. 

1. Introduction 

Decomposing an object into semantic parts enables a 
more detailed understanding of the object, which can pro¬ 
vide additional information to benefit many computer vision 
tasks such as pose estimation [42, 11], detection [3, 8], seg¬ 
mentation [1 ], and fine-grained recognition [4 ]. Thus, it 
has become an attractive research topic to leverage seman¬ 
tic part representation through part detection [5, 16, 8], and 
human joint estimation [36]. 

In the literature of semantic segmentation, while object- 
level segmentation over multiple object categories has 
been extensively studied along with the growing popular¬ 
ity of standard evaluation benchmarks such as PASCAL 
VOC [15], object parsing (i.e., segmenting objects into se¬ 
mantic parts) is addressed mostly for a few specific cat¬ 
egories provided with accurate localization such as hu¬ 
man [45, 13, 11] and cars [1 ]. 

With the increasing availability of semantic part anno- 
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Figure 1. We handle the prediction of semantic object and part 
segmentation in a wild scene scenario, (a) Original image. (b)&(c) 
are the object and part segmentation respectively, generated from 
our algorithm. 

tations [8], more recent works have attempted to handle 
more difficult classes like animals with homogeneous ap¬ 
pearance [39], and to perform both object and part segmen¬ 
tation [21], as illustrated in Fig. 1. However, in [21], ob¬ 
ject and part segmentation are performed sequentially, in 
which the object mask is first segmented, and then the part 
labels are assigned to the pixels within the mask. As a re¬ 
sult, the errors from the predicted semantic object masks 
may be propagated to the parts. 

In fact, object and part segmentation are complemen¬ 
tary and mutually beneficial to each other. Semantic ob¬ 
ject segmentation requires a larger receptive field in order 
to correctly recognize the object, while part segmentation 
focuses on local details to obtain more accurate segmenta¬ 
tion boundaries and accommodate large pose and viewpoint 
variations. If these two tasks are tackled simultaneously, by 
integrating the object-level guidance with part-level detailed 
segmentation, we can address two of the most challenging 
problems at the same time, i.e., discovering the subtle ap¬ 
pearance differences between different parts within a single 
object, and avoiding the ambiguity across similar object cat¬ 
egories. Motivated by this observation, we propose a joint 
solution to object and part segmentation, in which the con¬ 
sistency of the object and parts are enforced through joint 
training and inference. 

Fig. 2 shows the overall framework of our approach. 
When performing part segmentation over multiple object 
classes, the appearance of some parts may be very simi¬ 
lar, e.g., horse legs and cow legs. Therefore in order to re¬ 
duce the ambiguity and complexity during training, instead 
of treating each semantic part type independently [21], we 
allow some part labels to be shared by related object classes, 
and group the labelled parts of different classes into a se- 
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Figure 2. Our framework for joint part and object segmentation. Given an image, a two-channel FCN is performed to predict both the 
semantic compositional parts (SCP) and object potentials with different input image scales. The two potentials are then concatenated and 
fed to a new convolutional layer to predict joint object potentials. Finally, from SCP potentials, SCP segments are proposed as nodes for 
the fully-connected CRF to jointly infer the part and object labels. 


mantic compositional part (SCP) representation based on 
their appearance and shape similarity ( e.g ., horse legs and 
cow legs belonging to the same type of leg). Details of part 
sharing and SCP labels are illustrated in Fig. 3. Since we are 
performing object and part segmentation jointly, the ambi¬ 
guity of shared part labels can be solved by the object label. 

Given an image with certain objects, we train a two- 
channel fully convolutional network (FCN) with the first 
channel predicting the SCP potentials while the second pre¬ 
dicting the object potentials. We then concatenate the two 
potentials as the input to an additional convolutional layer 
to refine the object potentials through joint training. At the 
same time, a compact set of SCP region proposals are gener¬ 
ated from the SCP potentials. Using these region proposals 
as nodes, we construct a fully connected conditional ran¬ 
dom field (FCRF) to further incorporate the object and SCP 
potentials, yielding the jointly predicted object and part seg¬ 
mentations. In our FCRF, the consistency between the ob¬ 
ject and parts are enforced with long-range constraints. 

We did extensive experiments over three derived datasets 
based on PASCAL VOC segmentation benchmark [14, 8]. 
Experimental results demonstrate that our joint approach 
improves both object and part segmentation, and signifi¬ 
cantly outperforms the state-of-the-art on both tasks. 

2. Related work 

In the literature of detection, the usefulness of mining se¬ 
mantic part representation in helping object recognition has 
been long studied. Felzenszwalb et.al [16, 19] proposed the 
deformable part-based model (DPM) which is an implicit 
way of discovering hidden parts. Later models use more 
accurate part-representation through explicit part supervi¬ 
sion [34, 46]. Poselets [5] are proposed to model the local 
human parts through 2D projections from 3D data, which 
can be used as robust representation for both detection [20] 


and segmentation [29, ]. In addition to human body parts, 
Azizpour et. al [3] explicitly induce the bounding-box an¬ 
notations of animal parts, yielding stronger detection results 
on both object and parts. Chen et. al [8] extend such ideas 
by providing richer labeling of part segments, which can 
better capture the appearance features for learning. 

In the literature of segmentation, semantic parsing has 
also been actively investigated. However, due to its in¬ 
creased challenge in getting detailed boundaries, most pre¬ 
vious work focused on parsing objects given both the se¬ 
mantic category and a cropped bounding box with no oc¬ 
clusion, such as human parsing [4, 45, 41, 13, ] ], car 
parsing [35, 13, 2! ] or animal parsing [39]. Such meth¬ 
ods are limited in their applications, as objects in real- 
world images are often occluded with large deformation 
and appearance variations, which is difficult to be handled 
by those shape-based [4] or appearance-based [11] models 
with hand-crafted features or bottom-up segments. 

Recently, deep convolutional neural networks [24] 
(CNN) have achieved great success in many applications 
such as object detection [17, 22, 47] and end-to-end seg¬ 
mentation [40, 12, 21, 27, 7], with advanced network struc¬ 
tures such as the VGG-Net [3- ]. Some studies tried to 
understand the implicitly learned filters [4. ] or compara¬ 
ble structures with the DPM [38, 1 ] in the network, and 
discovered some meaningful part representations in deeper 
layers. However, such representations are still not semantic 
parts, and using explicit supervision from semantic part la¬ 
bels for segmentation has not been investigated. In our ap¬ 
proach, we propose to explicitly model the semantic parts 
along with the whole object by taking advantage of the re¬ 
cent advance of fully convolutional network (FCN) [21]. 
It has been very successful in predicting structured output 
such as semantic object segmentation. Specifically, FCN 









































converts the fully-connected layers in the original CNN 
to 1 x 1 convolution layers, thus can efficiently perform 
sliding-window-based classification at each pixel with a cer¬ 
tain receptive field. However, it starts from local convo¬ 
lutional kernels with limited receptive fields and may not 
be able to capture all the long-range context, yielding lo¬ 
cal confusions. In our case, we solve such problem through 
modelling over object scale context, which is also suggested 
in prior arts [5, 9, 37, 10]. 

Perhaps the closest work in our scenario is the hypercol¬ 
umn approach [2 ]. However, they perform object and part 
segmentation sequentially, and train many part classifiers 
separately for each class, which may suffer from increased 
training cost and has less scalability when the number of 
object class is large. In contrast, we use semantic composi¬ 
tional part (SCP) to allow part sharing and reduce training 
complexity. Moreover, our model leverages the advantage 
of both object segmentation and semantic part segmenta¬ 
tion, yielding strong results in very challenging scenarios. 
To the best of our knowledge, this is the first work that pro¬ 
vides a joint solution to tackle the segmentation of semantic 
parts and objects, which allows the part and object poten¬ 
tials interact and benefit each other. 

3. Joint part and object segmentation 

Our framework includes four major parts, i.e. shared se¬ 
mantic compositional parts (SCP) generation, part and ob¬ 
ject potentials, proposal of SCP regions and fully connected 
conditional random field (FCRF). In the following sessions, 
we will describe these techniques in details. 

3.1. Semantic compositional parts 

When performing semantic object segmentation and 
parsing over multiple object classes, some parts from differ¬ 
ent object classes yet with similar semantic meanings may 
also have very similar shapes and appearances ( e.g ., horse 
legs and cow legs). In such scenarios, allowing the parts to 
be shared among similar object classes [46, 3 ] could al¬ 
leviate the difficulties of distinguishing similar parts from 
different objects, and at the same time reduce the increasing 
complexity of training and inference as the number of ob¬ 
ject categories grows. Therefore, before we formally train 
our framework for these two tasks, we group those similar 
parts to form the semantic compositional parts (SCP) that 
are shared among related object classes. 

In particular, given a semantic part, it has an object label 
l Q (e.g., horse) and a particular semantic meaning l s (e.g, 
leg). The joint label of this part we would like to infer is 
denoted by l op (e.g, horse-leg). We group the original part 
labels l op to a shared compositional part representation l scp 
if they have the same semantic meanings l s and highly sim¬ 
ilar appearances and shapes. The SCP l scp are then used 
to compose different objects l Q as illustrated in Fig. 3. For 
example, in the case of horse and cow, the representation 



Figure 3. Illustration of our semantic compositional part (SCP) 
grammer in Sec. 3.1. Each SCP is associated with one semantic 
meaning, and all the objects are composed of several SCPs. 

of different labels are, l Q G {horse, cow}, l op G {horse- 
head, horse-body, horse-leg, horse-tail, cow-head, cow-leg, 
cow-body, cow-tail} and l s G {head, body, leg, tail}. If we 
allow the two objects to share the same type of body, leg and 
tail, we get the SCP label l scp G {head(h), head(c), bodyi, 
legi, taili }, as in Fig. 3, which is a much smaller predic¬ 
tion space than l op . The information of l op is kept in the 
connections between l Q and l scp . During the inference, by 
enforcing the consistency of l scp and l Q , l op can be directly 
recovered, e.g. known l Q = horse and l scp = legi, then 
we get l op = horse-leg. Currently we manually group those 
part labels. Nonetheless, automatically generating SCP is a 
very interesting problem especially with increased number 
of object categories, and will be investigated in the future. 

3.2. Deep part and object potentials 

In this section, we mainly describe the joint prediction of 
the semantic compositional parts (SCP) potentials and ob¬ 
ject potentials by a two-channel FCN, which is then used 
to construct the FCRF. Specifically, as illustrated in the 
framework (Fig. 2), the first channel of the FCN predicts 
a (N p + 1)-channel SCP potential map (N p is the SCP la¬ 
bel number, while the additional label represents the back¬ 
ground). Similarly, the second channel of the network pre¬ 
dicts a (N 0 +1)-channel object-class potential map (N 0 be¬ 
ing the object class number). In addition, we concatenate 
the SCP potentials and the object potentials as a set of high- 
level features, and feed them to a new convolutional layer 
for object potential refinement. This predicted joint object 
potential has less noise within the object and better bound¬ 
aries, as the SCP potentials contain more fine-level details 
that can interact with the object potentials during joint train¬ 
ing. As shown in our experiments (Sec. 4.3), compared 
with using the original object potentials from FCN, our joint 
object potentials provide better evidence for the graphical 
model later, yielding better final results. 

One may consider to also generate refined SCP poten¬ 
tials similarly using the joint prediction layer. However, us¬ 
ing the SCP potentials in this way does not show much im¬ 
provement in our experiments. This is because firstly, the 
SCP potentials have already encoded more detailed bound¬ 
ary information than the object potentials, and the inter- 



actions would not help the SCP potentials to refine their 
boundaries. Secondly, the ambiguity of similar parts from 
different objects, which is the most challenging problem in 
part segmentation, has already been better addressed by us¬ 
ing part sharing and the object-scale FCRF. Therefore, the 
joint prediction for SCP refinement is not adopted in our 
framework to reduce the system complexity. 

Last but not the least, the SCP potentials and object po¬ 
tentials actually need different levels of context. This is also 
a key factor in DPM [1 ], where they use a larger image 
scale for part filters and a smaller scale for root filters. In 
our case, we adopt a similar strategy with different input 
resolutions to obtain proper receptive fields, i.e. s p x s p for 
SCP and s Q x s Q for object, where we require s p > s Q . We 
investigated the influence of input scale in our experiments, 
and chose the optimal s p and s Q using cross-validation. 

3.3. SCP segments proposal 

As mentioned earlier, our FCRF is built upon a compact 
set of SCP segments. In our case, traditional object proposal 
algorithms such as CPMC [6] or MCG [2] would typically 
fail due to the subtle difference of appearance between con¬ 
nected parts such as the leg and body. Nevertheless, we 
can use the SCP FCN network to generate accurate seg¬ 
ments that are associated with SCP. Based on the predicted 
(N p + 1)-channel SCP probability map, we assign the SCP 
label with the highest probability to each pixel and generate 
the SCP label map. SCP segments are then generated by 
grouping the pixels with the same SCP labels. 

Fig. 4 shows several examples of the proposed SCP seg¬ 
ments from an image. We can see our SCP segments work 
very well in terms of capturing correct semantic part regions 
and locating the object boundaries. To practically evaluate 
the segments, we did an oracle experiments by assigning the 
proposed SCP segments to be the overlapping ground truth 
class labels. The best possible object IOU is 85.1% over the 
quadrupeds animal set in Sec. 4.3, which performs reason¬ 
ably well for capturing the object boundaries. One might 
also apply local dense pixel-wise refine strategy [7] to fur¬ 
ther refine the segments, but it will increase computational 
cost and need further investigation. 

3.4. Joint FCRF 

FCN is essentially a sliding-window based approach, 
where its receptive field is always fixed to a local area given 
the input image, thus local confusion can hardly be avoided 
in many cases. Intuitively, the optimal receptive field should 
be close to the scale of the presented object in the image. 
Therefore, after generating SCP segments as well as SCP 
and object potentials from the joint FCN, we further con¬ 
struct a fully-connected CRF (FCRF) to automatically dis¬ 
cover the object-scale context, where all the parts of an ob¬ 
ject can interact with each other, yielding the optimal se¬ 
mantic object and part prediction simultaneously. 
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Figure 4. The SCP segments proposals handle various difficul¬ 
ties such as truncation, occlusion, deformation and view point 
changes. The colors maps shown at the bottom for each kind of 
SCP. We keep the color consistent in the results. 

Specifically, given the set of SCP segment proposals, we 
first cluster the proposed SCP segments into several groups 
according to their spatial distances, in case there are mul¬ 
tiple isolated objects in an image. Two SCP segments are 
merged in the same group if the minimum distance of the 
pixels within these two segments is smaller than a threshold 
t s = 10. We assume each group of SCP segments forms an 
object or overlapping objects in the image, which provides 
a good estimation of the object-scale. Then, we build one 
FCRF for each corresponding group. Formally, the FCRF 
can be represented as Q = {V,£}, where V is the set of 
SCP segments in the same group and £ is the set of edges 
connecting every pair of segments. As introduced in our 
framework (Fig. 2), for each SCP segment P, we want to 
predict its semantic part label l op ( P) (defined in Sec. 3.1), 
but can be reduced to separately inferring the object label 
/ G (P) and SCP label l scp (P) by enforcing their label con¬ 
sistency. In the following we use l l op for l op (Pi), l p for 
hcp(Pi) and l l Q for l 0 (Pi) for simplicity. 

In sum, our FCRF is formulated as, 

mm Y M l l P ) + V Y ^M Z o P > l l P ) 
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where, + V^(Z*)); (1) 

Mip, 1%) = & i;, it ); 

where C is the label set of the proposed SCP seg¬ 
ments, A e and \ p are balancing parameters. V’fGo) = 
— l°g(P(Z 0 ( x j))* * s the sum of pixel-wise object 
potentials inside the SCP segment P^, and l p ) is sim¬ 
ilarly defined with the SCP potentials. l p ) is a con¬ 
straint for part and object combination, which is 1 if 1 % Q , l l p 
is a meaningful combination, i.e. a connection exists be¬ 
tween l l Q and P p in the grammar as in Fig. 3, and set to be oo 
otherwise. 

In order to learn the pairwise potentials, we train a two- 


















layer fully-connected neural network, which takes the fea¬ 
tures from a pair of segments i and j as input, and predicts 
the probabilities of the four labels l l 0 , l J Q , l p , l J p . The ground 
truth labels for each training segment are the most domi¬ 
nant labels of the pixels within the segment, and we adopt 
the multinomial logistic loss for training. 

For the pairwise features, we consider multiple seman¬ 
tically meaningful cues from the pairwise relations, i.e. 
f ii = [ f f> Kij, where f; = [f 7 , i 7 , f ai ] T is a seg¬ 

ment self-descriptor, k i:j = [d^, d^} 7 is a spatial metric of 
the segment pair. Oj^ is the relative angle. We summarize 
the features as follows, 

• fjf the mean of pixel-wise object potentials. 

• f pi : the mean of pixel-wise SCP potentials. 

• f a i : the segment area, normalized by the discovered ob¬ 

ject area. 

• : the appearance geodesic distance from segment i to 
j, which accumulates the edge weights along the path 
from i to j on the image. 

• ($?■■'. the Euclidean distance between the center of the 

l j 

two segments normalized by the height and width of 
the object. 

• Oj\i\ the relative angle of the center of the segments j 

with respect to the center of segment i. 

Specifically for d “ •, the edge weight between the two neigh¬ 
bouring segments is the sum of the edgemap [25] values 
over their overlapping boundary. 

After the model is learned, given the pairwise feature f \j, 
the potentials , ^o^p^p) * s computed as the negative 

log-likelihood, i.e. -log(P(Z*)P(Zj)P(Zj)P(Zj)), where 
the probabilities are from the neural network prediction. 

For inference, since our graphical model have a very 
small number of nodes (less than 15 SCP segments in aver¬ 
age), using the efficient LBP [30], our algorithm can con¬ 
verge very fast within 5 iterations. 

3.5. Relation to the DPM structure 

While we are dealing with semantic segmentation and 
parsing, our approach follows the spirit of DPM [16] in ob¬ 
ject detection. In [18], Girshick et. al connected the DPM 
root filter and part filters with the convolutional filters from 
CNN, and the distance transform can be regarded as an ad¬ 
ditional pooling and geometry filtering step. Our work aims 
to solve the segmentation task here but has analogy with 
DPM, where the object prediction can be considered as the 
root filter and SCP potentials are our learned part filters. 
The difference is that rather than firing when the target is at 
the exact center of the detection window, our model fires at 
every location inside the target which produces more accu¬ 
rate location estimation. Regarding the part and root geom¬ 
etry, rather than explicitly modeling the geometry of root- 
part distance transform as in DPM, we implicitly model 


these spatial relationships using spatial distances and rela¬ 
tive angles as spatial features and learn the pair-wise poten¬ 
tials in the fully-connected graphic model, which is more 
data-driven and generalizes better for handling variations. 

For inference, DPM tries to find the probability of the 
part-root locations p given the object label l Q through slid¬ 
ing window, i.e. max pG i P(p|/ 0 ). In our case, sliding- 
window for part and object localization is realized by our 
FCN, based on which we can infer over the smaller label 
space i.e. max/ oG £ P(l Q |p), at the object-scale context. 

4. Experiments 

In this section, we provide all the experimental de¬ 
tails, and evaluate our approach in terms of different ex¬ 
perimental settings to demonstrate the advantage of our 
approach. Specifically, we conduct experiments on the 
Horse-Cow parsing dataset introduced in [39], the PASCAL 
Quadrupeds dataset from [ 8 ], and our PASCAL Part bench¬ 
mark, over which extensive comparisons are performed. 
4.1. Model training. 

Both of our FCN models for SCP and object prediction 
are based on the 16-stride (16s) FCN, since there is trivial 
improvement (less than 1% as shown in our Tab. 2 and Tab.2 
of [27]), while significant more time required to train a 8 - 
stride model ( 8 s). 

Data augmentation. Effective data augmentation is the 
key to the success of FCN. Thus in our case, we did suffi¬ 
cient augmentation by first cropping out the connected ob¬ 
ject masks using a random generated bounding box around 
it. The size of the crop is specially 1.3 times larger than the 
object bounding box, which is a rough localization of the 
object for generalization. Then, each cropped image is re¬ 
sized into 300 x 300, based on which we further augmented 
using the ideas from [12], i.e. we perform 4 additional ran¬ 
dom cropping at the size of 200 x 200 , flipping, changing 
the color intensity by a random scale in [0.7,1.3] with a 
probability of 0.4 and rotation in [—5, +5] degree with a 
probability of 0.5. In average, each image is augmented up 
to around 25 training samples. 

Optimization and step-wise training. We fine-tune all 
our models step-by-step from the publicly available VGG- 
net [27]. For training the SCP FCN, we first train a 32s 
FCN with the learning rate as 10 -4 for the final convf c s 
layer which is the layer predicting the SCP potentials, and 
10 -5 learning rate for the layers after pool 4, while we fix 
the layers before poolA. Then for the 16s FCN, we start 
with training the conv poo 4 layer which is used for predicting 
SCP potentials from the pool 4’s output. Then, we use the 
trained layers as initialization for the final 16s FCN training. 
We further fine-tune the conv / c g and conv poo 4 layer with a 
learning rate as 10“ 5. For training the joint object FCN, we 
fix the SCP FCN and concatenate the SCP potentials with 
the 16s object potentials, over which another convolutional 


layer convj nt (with a kernel size of 5) is used to predict the 
joint object potentials. For fine-tuning this model, we use 
10 -4 learning rate for the convj nt layer, while 10 -6 learn¬ 
ing rate for the convf c g and conv poo i 4 of the object FCN. 
In all the cases, we keep the batch size as 32. For the other 
parameters, we refer to the ones given by [27]. 

For the two-layer neural network in training the pairwise 
potentials of the fully-connected CRF, we set 32 hidden 
nodes, and use the RELU for no-linearity with a dropout 
rate of 0.2 to regularize the model. We use a batch size of 
10000 and set the learning rate to be 10 -2 . In learning the 
pair-wise term, other than separately training SCP and ob¬ 
ject labels, i.e. l scp and l Q , one may also consider output the 
semantic part label l op which lies in the joint space of object 
and part. Nevertheless, we found it is harder to train due to 
that a lot more data are required for prediction in a high di¬ 
mensional output space. Thus we chose to train separately. 

All our neural networks are based on the caffe plat¬ 
form [23] and partially from the code provided by [2 1 ]. 

4.2. Parameters and details 

In the FCRF, we set A e = 2 and X p = 0.3 which are 
validated over a validation set from the Quadrupeds dataset. 
The same set is used for all other validation experiments. 
For inference over the graphical model, we use the LBP 
tool provided by Meltzer 1 . 

Investigation of input image scale. The input image 
scale for FCN is one of the most important factors for 
achieving good performance as also shown in recent 
works [7, 26]. Fig. 5 shows the investigation of changing 
the input image scale for SCP proposals and object poten¬ 
tials. In the top row, we can see the larger the image scale 
is, the more accurate the object boundaries could be, while 
the more local confusion it has, e.g. a leg of the cow is start¬ 
ing to be confused with horse. Thus, we validate the scale 
of s p G {400, 500, 600} and s Q G {300,400, 500}. At the 
bottom of Fig. 5, we show the validated results, and the op¬ 
timal combination, i.e. s p = 600, s Q = 300 is used in all 
our experiments. 

Training and inference time. For training, we found the 
model would converge after 40k iterations and it takes 
around 2 days for SCP FCN, 1 day for object potentials 
and 8 hours for the graphical model with a platform of 4 
core 3.2Hz CPU and a K40 GPU. For inference, in average, 
one image takes 0.3s for FCN forward propagation with the 
GPU and 1.3s for the graph model with our CPU. 

4.3. Performance comparisons 

We compare our algorithm with three state-of-the-art 
methods on the two tasks. For semantic part segmenta¬ 
tion, we compare with the most recent compositional-based 

1 http://www.cs.huji.ac.il/ talyam/inference.html 
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Figure 5. Investigation of image scales. Top: an example from our 
joint object FCN prediction, showing that larger scale leads to finer 
boundaries, but introduces local ambiguities. Bottom: validate the 
accuracy of scale combination of s 0 and s p . 

semantic part segmentation (SPS) [39] over the Horse- 
Cow parsing dataset, and the hypercolumn (HC) [2 ] in all 
datasets. For object segmentation, we compare our method 
with the FCN [2 r ]. We use the code provided by the author 
and fine-tune their model based on our dataset, including 
tuning a fixed optimal image scale for a fair comparison. 
For comparing with HC [21], as there is no available code 
from the author, we follow their part parsing strategy by 
first performing figure-ground mask with the trained FCN 
8s [27], and then assign part labels inside with a optimally 
tuned image scale to be our baseline method. For evalua¬ 
tion, we adopt the standard intersection over union (IOU) 
criteria for both tasks. 

Horse-Cow parsing dataset. The Horse-Cow dataset is a 
two-animal part segmentation benchmark proposed in [3S ]. 
For each animal class, they manually select the mostly ob¬ 
servable animal instances in both train-val and test set from 
the PASCAL VOC 2010 [15]. There are 294 training exam¬ 
ples and 277 test images. Since it is not published yet, we 
asked the author for this dataset and their results. In [39], 
the task is to segment the part given the object class. Thus 
for a fair comparison, we also test every instance with the 
object class known. In such a case, our graphical model in 
Eqn.(l) is reduced to inferring the part label l p with a binary 
object potential l Q . 

Tab. 1 provides the results of SPS [39], HC [21] and 
our method over semantic part and figure-ground segmenta¬ 
tion. Our method outperforms the previous state-of-the-art 
with a significant margin, averagely 13% better than SPS 
and 4.5% better than HC. Several qualitative examples are 
shown in Fig. 6, where we keep the color map consistent 
with the SPC labels (Fig. 4). In these cases, we can see that, 
built on explicit geometry rules, SPS is limited in handling 
large variance of object parts, which makes it difficult to 
model highly deformable cases like the tail, or occluded an¬ 
imals. For HC, due to inaccuracy from the object mask, it 
may miss some detailed regions like legs (the l st column of 
horse). We will further compare to HC in later benchmarks. 











Horse 


Bkg 

head 

body 

leg 

tail 

Fg 

IOU 

Pix. Ace 

SPS[ ] 

79.14 

47.64 

69.74 

38.85 

- 

68.63 

- 

81.45 

HC [ l] 

85.71 

57.30 

77.88 

51.93 

37.10 

78.84 

61.98 

87.18 

Ours 

87.34 

60.02 

77.52 

58.35 

51.88 

80.70 

65.02 

88.49 

Cow 


Bkg 

head 

body 

leg 

tail 

Fg 

IOU 

Pix. Ace 

SPS [39] 

78.00 

40.55 

61.65 

36.32 

- 

71.98 

- 

78.97 

HC [ ] 

81.86 

55.18 

72.75 

42.03 

11.04 

77.07 

52.57 

84.43 

Ours 

85.68 

58.04 

76.04 

51.12 

15.00 

82.63 

57.18 

87.00 


Table 1. Average precision over the Horse-Cow dataset. 


Quadrupeds dataset. We further extend our experiments 
into the Quadrupeds part dataset which contains five animal 
classes, i.e., cat, dog, sheep, cow and horse. In this task, 
we simultaneously predict the object and part masks. We 
obtain the data given by [8], which include all the part la¬ 
bels in PASCAL VOC 2010 training and validation images. 
We select the images containing the target objects, and treat 
the validation set as our test images, resulting in 3120 train¬ 
ing images and 294 testing images. In addition, since we 
are focusing more on semantic part segmentation, during 
testing, we roughly localize the object inside by using the 
strategy in data augmentation (Sec. 4.1). We will provide 



Figure 6. Comparison examples with SPS [39] and HC [2 ] 
in the horse, cow dataset and the color map is in Fig. 4 (Best 
view in color). 


our code of this localization to others for a fair comparison. 
It should be noted that such localization is still very coarse 
with much looser bounding boxes than the ones needed in 
previous parsing methods [11, 39]. 

Following HC, the parts of all the quadrupeds are la¬ 
belled into head, body, leg and tail, from which we con¬ 
struct a shared SCP grammar as introduced in Sec. 3.1. In 
our grammar, horse, cow and sheep share the same body 
and leg type, while cat and dog share others. All the an¬ 
imals share a tail label, and each animal has its own head 
label since the head is highly distinguishable [32]. In total, 
10 SCPs are used, while 20 labels are need if all the parts 
are treated independently.. 

Tab. 2 shows the compared results on both object and 
part segmentation. As shown, in terms of object segmenta¬ 
tion, comparing with FCN [27], our final results improves 
6.2%, which demonstrates the advantage of the joint model. 
We also compare the variants of our approach with different 
components. “Joint FCN(16s)” produces the object masks 
directly from the two-channel FCN without FCRF infer¬ 
ence. By including the SCP potentials, the object predic¬ 
tion of “joint FCN (16s)” already out-performs FCN, which 
shows that the joint object potentials have less pixel-wise 
confusion. In addition, as shown in the “FCRF+FCN(16s)”, 
using the FCRF with the FCN object potentials without the 
joint convolutional layer improves over 2% compared to 
“FCN(16s)”, showing FCRF inference can help refine seg¬ 
mentation results by exploring long-range context. More¬ 
over, with the joint potentials, the performance of our full 
model has another 4% boost. This shows the joint FCN 
object potentials provide better evidence for our graphical 
model and is essential to our system. 

We show several qualitative comparison examples with 
the FCN at left of Fig. 7, in which our algorithm is able 
to solve the local ambiguities that FCN usually encounters. 
For instance, at the A t h row, the legs of a cow are confused 
with horse using FCN. In contrast, by borrowing the object- 
scale evidence from the cow body and cow head, our model 
can correct this local confusion. In addition, as shown in the 
horse segments at the 3 r d row, our SCP segments are able 
to provide more precise object boundaries in many difficult 
cases, like the legs of the horse crossing the bar. 



Object segmentation accuracy 



Bkg 

Dog 

Cat 

Cow 

Horse 

Sheep 

IOU 

Pix. Ace 

FCN 16s [Z ] 

93.25 

74.30 

78.62 

61.88 

56.56 

67.63 

72.04 

93.00 

FCN 8s [ ] 


93.55 

74.39 

78.52 

60.81 

58.39 

69.15 

72.47 

93.17 

Joint FCN(16s) 

94.04 

75.13 

80.52 

66.76 

63.04 

71.54 

75.17 

93.77 

FCRF+FCN(16s) 

93.88 

77.10 

80.92 

68.76 

63.40 

64.54 

74.57 

93.87 

Ours final 


94.40 

79.03 

83.04 

74.82 

69.94 

70.59 

78.64 

94.71 


Bkg 

Dog 

Semantic part segmentation accuracy 

Cat Cow Horse Sheep IOU 

Pix. Ace 

HC [2 ] 

92.83 

42.07 

' 43.99 35.49 38.59 33.80 41.36 

89.54 

Ours final 

94.46 

45.63 

1 47.81 42.7 49.60 35.74 46.69 

91.74 


Table 2. Average precision over the Quadrupeds data. 





































Dog 


Cat 


Cow 


Horse 


Sheep 



Image FCN-8s [27] Ours object Object GT HC [21] Our part Part GT. 


Figure 7. Comparison examples with FCN [ 17] for object segmentation and HC [21] for part segmentation in the Quadrupeds dataset. The 
object label color map is shown at above and part label color map is shown in Fig. 4 (Best view in color). 


For semantic part segmentation, we summarized the 
mean IOU of all parts for each object in Tab. 2. Our re¬ 
sults are also significantly higher than the results of HC [21] 
(over 5%). As HC performs object and part segmentation 
sequentially, the errors in object predictions, including lo¬ 
cal confusion and inaccurate boundaries, will propagate to 
the parts. For example, at the 2 n d row of Fig. 7, the cat 
and dog are confused in the object FCN prediction, which 
makes part of the cat head errorly labelled as dog head in 
HC. We solve such problems through the FCRF by consid¬ 
ering object-scale context. In addition, thanks to our opti¬ 
mized image scale for both object and part, our method can 
capture part boundaries that are sometimes missed by the 
FCN and HC. 

PASCAL part segmentation benchmark. In addition to 
the labels from the train-validation set of [8], following the 
same hierarchical part labelling system, we additionally la¬ 
belled semantic parts over the PASCAL VOC 2010 test 
set [14] of the object segmentation task, which includes 994 
images. With respect to the PASCAL VOC test benchmark, 
we are not going to release the labels and will instead launch 
an evaluation server for researchers to fairly compare their 
part segmentation results. 

We test our algorithm over images with the five 
quadrupeds, which include 281 images. As shown in Tab. 3, 
the results are consistent with that from the validation set, 
and our method achieves the best overall IOU outperform¬ 
ing the state-of-the-art with a large margin. 


5. Conclusion and discussion 

In this paper, we proposed the framework for jointly 
solving object and part segmentation. Our approach fol¬ 
lows the spirit of DPM, and leverages the advantages of 
both sides, yielding the state-of-the-art results. Recently, for 
object segmentation, there are other methods that provides 
additional improvement over the FCN such as adding pixel- 
wise Dense CRF [7, 2t\. Our framework is complementary 
to them in terms of modelling over parts and adaptive to 
object scale for solving local ambiguity. In addition, we 
tackles the part segmentation beyond the object. Possible 
failure cases for us would be strong appearance confusion, 
strong occlusion, where object scale or SCP segments can 
be misled, yielding inaccurate results ( e.g . the back leg of 
the dog in the first example of Fig. 7). 

In the future, we will try to automatically learn the SCP, 
and jointly model the instance segmentation to incorporate 
detection, which provides better object scale and solve the 
remaining localization issue. 





Object segmentation accuracy 




Bkg 

Dog 

Cat 

Cow 

Horse 

Sheep 

IOU 

Pix. Ace 


FCN 8s [27] 

94.45 

70.14 

75.45 

64.06 

64.75 

69.06 

72.99 

93.90 

Ours final 

95.31 

77.44 

80.47 

72.13 

76.18 

67.96 

78.25 

95.26 

Semantic part segmentation accuracy 


Bkg 

Dog 

Cat 

Cow 

Horse 

Sheep 

IOU 

Pix. Ace 

HC [21] 

94.36 

41.24 

42.42 

35.22 

45.00 

38.86 

43.11 

90.64 

Ours final 

95.14 

46.52 

48.06 

41.80 

56.67 

36.02 

48.16 

92.47 


Table 3. Average precision over the part segmentation benchmark. 
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